Skip to content

feat: add ResourceRegistrar wiring, plugin-data mount, and upload auth#54

Merged
kaiitunnz merged 12 commits into
mainfrom
kaiitunnz/feat/plugin-data-dir
May 28, 2026
Merged

feat: add ResourceRegistrar wiring, plugin-data mount, and upload auth#54
kaiitunnz merged 12 commits into
mainfrom
kaiitunnz/feat/plugin-data-dir

Conversation

@kaiitunnz
Copy link
Copy Markdown
Collaborator

@kaiitunnz kaiitunnz commented May 20, 2026

Purpose

FLOWMESH_PLUGIN_DIR is read-only — it's the code mount — so plugins that need writable persistence (a SQLite ACL, a cache, an ownership table) have nowhere to write. This adds a second, writable mount at /app/plugin-data, driven by a single FLOWMESH_PLUGIN_DATA_DIR env var. The immediate consumer is lumid-flowmesh-plugin's ResourceRegistrar / PermissionChecker ACL.

It also closes an authorization gap on the result and trace upload endpoints: they authenticated the caller but ran no authorization, so any valid token could overwrite another task's results, artifacts, or trace files. They now require a kind-level RESULT/WRITE permission. This is a no-op when no PermissionChecker is registered, so default behavior is unchanged; it pairs with the flowmesh:results:write scope in the matching lumid-flowmesh-plugin PR.

Changes

  • Plugin-data mountcli/stack/.../assets/compose.yml, env_schema.py, .env.example, utils.py, stack.py: add the /app/plugin-data mount and route FLOWMESH_PLUGIN_DATA_DIR to either a host bind or an external Docker volume.
  • Upload authorizationsrc/server/routers/v1/results.py, traces.py: gate ingest_result, upload_result_file, and upload_task_trace on kind-level RESULT/WRITE, with deny-path tests.
  • Docsdocs/ENV.md, docs/PLUGINS.md, hook/README.md: document the three mount modes; fold in a pre-existing HookBindings/BaseBindings correction.

Design

FLOWMESH_PLUGIN_DATA_DIR is one env var with three modes, discriminated by the same rule docker compose uses for short-syntax mount sources (in apply_plugin_data_env): a path (relative resolves against the operator's cwd, or absolute) becomes a host bind auto-created on stack up; a bare name becomes an external volume the operator precreates with docker volume create. Plugin code stays read-only; plugin data is writable and separately mounted.

The upload gate is kind-level (RESULT, None, WRITE) rather than per-task because the uploader is the worker/operator key, which holds no per-task grant — authorization is by scope, not ownership.

Test Plan

uv run pre-commit run --all-files
uv run scripts/dev/check_env_examples.py
uv run pytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py
  • docker compose config across all three FLOWMESH_PLUGIN_DATA_DIR modes (empty/path default, custom path, bare volume name).
  • Local two-part stack end-to-end (flowmesh stack up + CPU worker) against a locally built image carrying this branch, using a logging-only dummy plugin to observe FlowMesh's hook wiring against the new mount.

Test Result

$ uv run pre-commit run --all-files          # all checks passed
$ uv run scripts/dev/check_env_examples.py   # clean
$ uv run pytest tests/ ...                    # all tests passed

docker compose config produced the expected bind / volume shapes for all three modes.

End-to-end (passed):

  • FlowMesh fires reconcile_resources once at boot with the full live enumeration, and register / require_permission / resolve_accessible_ids at the expected endpoints.
  • The plugin's SQLite ACL is created under /app/plugin-data, survives a stack down / stack up, and the startup reconcile correctly refreshes live grants and prunes stale ones.
  • The result/trace upload gate's allow- and deny-paths are covered by unit tests; the local-executor workflow used here writes results to the shared results dir and does not exercise the HTTP upload endpoints.

Pre-submission Checklist
  • I have read the contribution guidelines.
  • I have run pre-commit run --all-files and fixed any issues.
  • I have added or updated tests covering my changes (if applicable).
  • I have verified that uv run pytest tests/ passes locally.
  • If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
  • If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen).
  • If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
  • I have updated documentation or config examples if user-facing behavior changed.

@kaiitunnz kaiitunnz changed the title feat(stack): add FLOWMESH_PLUGIN_DATA_DIR for plugin writable storage feat: add FLOWMESH_PLUGIN_DATA_DIR for plugin writable storage May 20, 2026
@kaiitunnz kaiitunnz force-pushed the kaiitunnz/feat/plugin-data-dir branch 4 times, most recently from 3c6fcd7 to e4a7b74 Compare May 22, 2026 14:45
kaiitunnz and others added 10 commits May 26, 2026 17:43
FLOWMESH_PLUGIN_DIR is deliberately read-only — it's the code mount —
so plugins that need persistence (a SQLite ACL, a cache file) currently
have to ask each operator to add a custom bind via
docker-compose.override.yml, which is brittle.

Add a second mount at /app/plugin-data driven by FLOWMESH_PLUGIN_DATA_DIR.
A path value (the default, `./plugin-data`) is a host bind, auto-created
on stack up; a bare name is an external Docker volume of that name,
which the operator precreates and owns the lifecycle of. The CLI
discriminates path vs bare name in apply_plugin_data_env and routes the
value to either the service mount directly (path mode) or an internal
FLOWMESH_PLUGIN_DATA_VOLUME parameter consumed by a parameterized
external volume declaration (volume mode).

Signed-off-by: Noppanat Wadlom <[email protected]>
The default external volume name was the bare alias
`flowmesh_plugin_data`, which would collide across multiple stacks
sharing a host (FLOWMESH_STACK_SUFFIX exists exactly to scope these
names per-stack). Match the convention used by the other named volumes
in the same file.

Signed-off-by: Noppanat Wadlom <[email protected]>
`apply_plugin_data_env` was hooked into `_stack()._load`, which runs on
every stack subcommand (`up`, `down`, `ps`, `logs`, `clean`, ...) via
DockerComposeStack.run, so `ensure_dir(resolved)` materialized
./plugin-data in the operator's cwd even on read-only commands.
FLOWMESH_PLUGIN_DIR avoids this by routing its mkdir through
`ensure_deploy_paths`, which only runs when `to_deploy=True`.

Split the helper: env-var routing (path resolution + alias re-route)
stays in `apply_plugin_data_env` so compose substitution always has
the right values; the `ensure_dir` call moves into
`ensure_deploy_paths`, gated on `FLOWMESH_PLUGIN_DATA_VOLUME` being
unset (volume mode is operator-managed, no host dir to create).

Also drops the redundant "~/" prefix from _PLUGIN_DATA_PATH_PREFIXES
(the trailing "~" already subsumes it) and adds five unit tests
covering the empty / relative / absolute / tilde / bare-name branches.

Signed-off-by: Noppanat Wadlom <[email protected]>
When the host restarts past a ResourceRegistrar's TTL, persistent
plugins were silently dropping grants on long-lived workers and
workflows. The reconcile sweep replaces TTL pruning with a stronger
guarantee: on boot, enumerate every live resource (workflows + their
tasks, nodes, workers) and hand the full batch to each registrar's
`refresh`, then call `purge_stale` once per registrar. Registrars
that persist state use the pair to drop records the server no longer
tracks.

- `auth.security.refresh_resources(refs, logger)` and
  `purge_stale_resources(logger)` fan out to every
  `ResourceRegistrar`.
- `_lifespan` runs `_reconcile_resources()` once after plugins load
  and the system principal resolves. Worker / workflow enumeration is
  gated on `IS_ROOT_NODE`; nodes are always present.
- docs/PLUGINS.md notes the new sweep on the `ResourceRegistrar`
  bullet.

Requires lumid-hooks 0.2.0 for the new Protocol methods. The
`tool.uv.sources` entry pointing at `../lumid.hooks` is temporary —
drop it (and refresh the pinned requirements files) once 0.2.0 is on
PyPI.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Noppanat Wadlom <[email protected]>
Review findings on the startup-reconcile work:

- Move `_reconcile_resources()` after `SUPERVISOR.start(...)` so this
  node is in `NODE_REGISTRY` by the time the sweep enumerates it.
  Previously the local node was systematically missing from the refresh
  batch.
- `refresh_resources` now returns the set of registrar names whose
  `refresh` raised, and `purge_stale_resources(..., skip=...)` excludes
  them. A registrar that fails mid-sweep no longer wipes rows it never
  marked live; other registrars still complete normally.
- `docs/PLUGINS.md` locates the sweep relative to plugin install /
  supervisor handshake and documents the host-side guard.
- Style/ops: `TODO(lumid-hooks-0.2.0):` prefix on the `tool.uv.sources`
  comment so future cleanup surfaces in grep; `uv.lock` revision
  restored to 3 (was downgraded by an older local uv).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Noppanat Wadlom <[email protected]>
Match lumid-hooks 0.2.0's single-method Protocol. The fan-out helper
in `auth.security` becomes one call; the failed-set plumbing in
`_lifespan` goes away because `reconcile` is atomic on the plugin
side — a raised exception leaves the registrar's store unchanged, so
the host can log and move on.

- `auth.security.reconcile_resources(resources, logger)` replaces
  `refresh_resources` / `purge_stale_resources`. Each registrar's
  `reconcile` is wrapped in try/except + `logger.exception`; the sweep
  continues with the next registrar on failure.
- `_lifespan` calls `reconcile_resources(refs, logger)` once and
  drops the failed-set threading.
- Tests trade the old refresh/purge dispatch tests for one happy-path
  fan-out test and one that asserts a failing registrar is isolated
  while the rest still run.
- docs/PLUGINS.md reflects the single-call shape and the
  atomic-or-rollback semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Noppanat Wadlom <[email protected]>
lumid-hooks 0.2.0 (with the new `ResourceRegistrar.reconcile`) is
released on PyPI, so the editable path override from
`[tool.uv.sources]` is dropped. flowmesh-hook's pin moves to >=0.2.0
since it re-exports the breaking-changed Protocol surface. Server
requirements regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Noppanat Wadlom <[email protected]>
`_reconcile_resources` walked `NODE_REGISTRY` and `WORKER_REGISTRY`
but read `.node_id` / `.worker_id` off the records — those attributes
don't exist on the `Node` / `Worker` Pydantic models, both of which
expose `.id`. The mismatch crashed the FastAPI lifespan during the
startup reconcile sweep, taking the whole server down on every boot.

Caught while running the lumid plugin e2e plan against a freshly
built `FLOWMESH_VERSION=permission` image.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Noppanat Wadlom <[email protected]>
Signed-off-by: Noppanat Wadlom <[email protected]>
@kaiitunnz kaiitunnz force-pushed the kaiitunnz/feat/plugin-data-dir branch from 3d73a65 to 446730f Compare May 26, 2026 09:45
kaiitunnz added 2 commits May 27, 2026 23:01
`ingest_result`, `upload_result_file`, and `upload_task_trace`
authenticated the caller but ran no authorization, so any valid token
could write or overwrite a task's result, artifacts, or trace files.

Gate all three with a kind-level `require_permission(RESULT, None,
WRITE)`. The check is kind-level because the uploader is an
operator/worker key that holds no per-task grant; a concrete check
would resolve to the owning task's grant and deny legitimate uploads.
A no-op when no PermissionChecker is registered, so default behavior
is unchanged. Adds deny-path tests for each endpoint.

Signed-off-by: Noppanat Wadlom <[email protected]>
Drop the "(registrar plugins)" parenthetical; the lines below already
explain why the sweep runs after the supervisor handshake.

Signed-off-by: Noppanat Wadlom <[email protected]>
@kaiitunnz kaiitunnz requested a review from timzsu May 27, 2026 19:48
@kaiitunnz kaiitunnz marked this pull request as ready for review May 27, 2026 19:48
Copy link
Copy Markdown
Collaborator

@timzsu timzsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kaiitunnz kaiitunnz changed the title feat: add FLOWMESH_PLUGIN_DATA_DIR for plugin writable storage feat: add ResourceRegistrar wiring, plugin-data mount, and upload auth May 28, 2026
@kaiitunnz kaiitunnz merged commit a0b49ed into main May 28, 2026
14 checks passed
@kaiitunnz kaiitunnz deleted the kaiitunnz/feat/plugin-data-dir branch May 28, 2026 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants